In this document, we carry out a simulation exercise to evaluate the inference properties of different research designs aiming at measuring the short-term effects of air pollution on health. We consider different types of quasi experiments and for each associate an identification strategy.
This document focuses on running the simulations. The results are analyzed and discussed in another document.
We use data from 18 cities in France over the 2013-2018 period. The data set contains records of hospital admissions and deaths, mean concentration data for various air pollutants, a bunch of weather variables and calendar control variables (such as school holidays for instance). All variables are at the daily and city level. There is therefore a unique observation per date and per city in the data set.
Note that, for now we focus on quasi experiments for which treatment is binary and homogeneous, both in time and across individuals. It is also not correlated with covariates (apart in the case of the air pollution alerts in which it is correlated with the pollutant level).
Here, we consider interventions leading to changes in air pollution levels on some random days. Examples of such interventions include transportation strikes. Of course, dates are often not defined as random and are likely to be correlated with unobserved variables. In the present setting, we first consider the golden standard case in which these days are actually defined at random. One can think of this case as a Randomized Control Trial: it represents what would happen if an experimenter could implement a treatment increasing pollution on random days.
The effect of the treatment is estimated using a RCT type of model. The overall idea of the RCT is to compare the average number of deaths or hospital admissions in cities with treatment to cities with no treatment on the same day, controlling for differences across cities.
Here consider interventions leading to changes in air pollution levels in all cities after a given date. Examples of such interventions include an air pollution reduction policy at a national level or a change in regulations at a national level leading to an increase in pollution level.
The effect of the treatment is estimated using an ITS type of model. The overall idea of the ITS is to compare the average number of deaths or hospital admissions before and after treatment.
Here, we consider interventions leading to changes in air pollution levels in a subset of cities after a given date. Examples of such interventions include an air pollution reduction policy at a sub-national level or a change in regulations at a sub-national level leading to an increase in pollution level.
The effect of the treatment is estimated using a DID type of model. The overall idea of the DID is to compare the average number of deaths or hospital admissions in cities with treatment to cities with no treatment on the same day, controlling for differences across cities.
Here, we consider interventions leading to changes in air pollution levels in a all cities but the change happens after a different date for each city. Examples of such interventions include an air pollution reduction policy at a national level whose implementation is rolled out.
The effect of the treatment is also estimated using a DID type of model.
Here consider interventions that affect exposure to air pollution when air pollution levels reach a given threshold. Examples of such interventions include air pollution alerts: when pollution reaches a certain level, alerts are released, inviting people to reduce their exposure.
The effect of the treatment is estimated using a RD type of model. The overall idea of the RD is to compare days just below the threshold to days just above the threshold (where exposure and health impacts are thus lower). The assumption is that days just below and just above the threshold are comparable.
To sum up the analysis, the aim is to measure the inference properties of different research designs aiming at measuring the short-term effects of air pollution on health. For simplicity, consider the daily number of death as the output variable of interest for now. We proceed as follows:
Note that there are several parameters we can vary in order to evaluate the performance of the different methods: the identification strategy, the number of observations, the proportion of treated days, the true effect size and the model. In order to limit the number of simulations and for clarity, we only modify one parameter at the time, keeping others constant and equal to a baseline value. When we vary a parameter, we consider the following values:
We consider this set of parameters for each quasi-experiment.
In addition to considering an “ideal” case in which treatment is allocated randomly, the effect of the treatment is homogeneous and with perfect compliance, we can consider deviations from this “ideal” case:
| Allocation | Compliance | Effect of the treatment |
|---|---|---|
| Random | Yes | Homogeneous |
| Random | No | Homogeneous |
| Correlated with covariates | Yes | Homogeneous |
| Random | Yes | Heterogeneous |
Deviating from the ideal case therefore yields a very large number of cases. We limit ourselves to a sample of these cases.
In this section, we follow the steps described in the section “analysis” and carry out our analysis. We basically define a function for each step.
We load the packages and the data and wrangle it into a format well suited for this analysis.
First, we create a function to randomly draw a study period of a given length. For simplicity, we choose to have the same study period for each city. This also seems realistic; a study focusing on several cities would probably consider a unique study period.
This function randomly selects a starting date for the study, early enough so that the study can actually last the number of days chosen, and returns a boolean vector indicating whether each date is in the study or not.
Then, we create a function to draw the treatment. This function returns a boolean vector, stating whether each observation is in the treatment group or not.
For each treatment, we define the proportion of treated units (p_treat). The different treatments correspond to different types of quasi-experiments. Note that we will consider some identifications strategies that will only consider restricted samples (RDD and RDIT). For coding simplicity, to get an accurate proportion of treated units, we directly restrict the sample in this function. For the RDIT, we create another type of quasi experiment (“national_short”), even though the quasi experiment is actually the same as for a national intervention.
In this function, we made some simulations decisions:
p_treat cities.p_treat observations of the total sample are treated. Observations outside the bandwidth get a NA for treated. Thus, after calling this function, we need to filter out observations with a NA for treated.p_treat, we do not draw this date from the whole sample. We only consider dates close enough to the end of the study period such that, on average only p_treat of the data will be treated.p_treat. Note that it does not make sense to have a large p_treat as this would correspond to the national case. We therefore limit p_treat to be smaller than 0.5.draw_treated <- function(data, p_treat = 0.5, quasi_exp = "random_day") {
if (quasi_exp == "random_days") {
treated <- rbernoulli(length(data[["date"]]), p_treat)
} else if (quasi_exp == "national") {
num_date <- as.numeric(data[["date"]])
treated <- (num_date >= quantile(num_date, 1 - p_treat))
} else if (quasi_exp %in% c("GAM", "IV")) {
# treated <- TRUE
treated <- rbernoulli(length(data[["date"]]), p_treat)
treated <- ifelse(treated, treated, NA)
} else if (quasi_exp == "local") {
treated_cities <- unique(data[["city"]]) %>%
sample(size = round(length(.) * min(p_treat * 2, 1)))
treated <- (data[["city"]] %in% treated_cities &
data[["date"]] >= median(data[["date"]]))
} else if (str_starts(quasi_exp, "alert")) {
pollutant <- str_extract(quasi_exp, "(?<=_).+")
threshold_pos <- rbeta(1, 20, 2)
treated <- data %>%
group_by(.data$city) %>%
mutate(
threshold = quantile(.data[[pollutant]], threshold_pos, names = FALSE),
treated = (.data[[pollutant]] >= threshold),
bw = cut_width(
.data[[pollutant]],
width = length(.) * 2 * p_treat,
center = unique(threshold) #threshold center of an interval
),
treated = ifelse(
threshold > as.numeric(str_extract(bw, "([:digit:]|\\.|-)+(?=,)")) &
threshold < as.numeric(str_extract(bw, "(?<=,)([:digit:]|\\.)+")), treated, NA)
) %>%
ungroup() %>%
.$treated
} else if (quasi_exp == "rolling") {
treated <- data %>%
group_by(.data$city) %>%
mutate(
treated = (.data$date > sample(
seq.Date(
min(.data$date) + (1 - 2*p_treat)*length(.data$date),
max(.data$date), "day"
), 1))
) %>%
ungroup() %>%
.$treated
} else if (quasi_exp == "national_short") {
dates <- unique(data[["date"]])
bw <- floor(min(p_treat, 0.5) * length(dates))
date_start_treat <- sample(seq.Date(min(dates) + bw, max(dates) - bw, "day"), 1)
treated <- (data[["date"]] >= date_start_treat)
treated <- ifelse(
data[["date"]] > date_start_treat + bw | data[["date"]] < date_start_treat - bw,
NA, treated
)
}
return(treated)
# data <- data %>% mutate(treated = treated)
# return(data)
}
Both to verify that everything works well and for illustration, we can make quick plots:
We then create the output if a unit is treated, Y(1).
create_y1 <- function(dep_var,
percent_effect_size = 0.5,
quasi_exp = "random_days",
pollutant = NULL) {
# y1 <- output_var*(1 + percent_effect_size/100)
if (str_starts(quasi_exp, ("random|national|local|alert|rolling"))) {
y1 <- dep_var + rpois(length(dep_var), dep_var * percent_effect_size / 100) %>%
suppressWarnings() #warnings when is..na(dep_var) eg rpois(1, NA)
} else if (quasi_exp == "GAM") {
y1 <- dep_var + rpois(length(dep_var), pollutant * percent_effect_size / 100) %>%
suppressWarnings()
}
return(y1)
}
We can then estimate our model and retrieve the point estimate and p-value. The model should be specified in a three part formula as follows: y ~ x | fixed effects | cluster. If one does not want to set fixed effects, a 0 should be put in the second part of the formula: y ~ x | 0 | cluster.
estimate_model <- function(data, formula) {
#get the different parameters from the formula
fml <- Formula::as.Formula(formula)
cluster <- formula(fml, lhs = 0, rhs = 3) %>%
suppressWarnings() #when no cluster provided, warning
actual_fml <- formula(fml, rhs = -3)
se <- ifelse(cluster == ~0, "hetero", "cluster")
#run the estimation
est_results <- data %>%
feols(
data = .,
fml = actual_fml,
cluster = cluster,
se = se
)
#retrieve the usefull info
nobs <- length(est_results$residuals)
est_results %>%
broom::tidy(conf.int = TRUE) %>%
filter(term =="treatedTRUE") %>%
rename(p_value = p.value) %>%
mutate(estimate = estimate) %>%
select(estimate, p_value) %>%
mutate(n_obs = nobs)
}
We then create a function running all the previous functions together and therefore performing an iteration of the simulation for a given set of parameters. This function returns a one row data set with estimate, p-value, number of observations and true effect size.
compute_simulation <- function(data,
n_days_study = 200,
p_treat = 0.5,
quasi_exp = "random_days",
percent_effect_size = 0.5,
formula = "deaths_all_causes ~ treated") {
fml <- Formula::as.Formula(formula)
dep_var <- paste(fml[[2]])
sim_data <- data %>%
mutate(study_period = draw_study_period(date, n_days_study)) %>%
filter(study_period) %>%
select(-study_period) %>%
mutate(
treated = draw_treated(., p_treat, quasi_exp),
# true_treated = draw_treated(., p_treat, quasi_exp),
y0 = .data[[dep_var]],
y1 = create_y1(.data[[dep_var]], percent_effect_size),
yobs = y1 * treated + y0 * (1 - treated)
# yobs = y1*true_treated + y0*(1 - true_treated)
) %>%
filter(!is.na(treated)) #not necessary bc dropped in lm()
# filter(!is.na(true_treated)) #not necessary bc dropped in lm()
#for the local intervention, we estimate a DID
#and thus need a post and a city_treated variable
if (quasi_exp == "local") {
sim_data <- sim_data %>%
group_by(city) %>%
mutate(city_treated = as.logical(max(treated))) %>%
ungroup() %>%
group_by(date) %>%
mutate(post = as.logical(max(treated))) %>%
ungroup()
}
#for the nation intervention, we estimate an ITS
#and thus need a time index and time index
if (str_starts(quasi_exp, "national")) {
sim_data <- sim_data %>%
group_by(city) %>%
mutate(
t = as.numeric(date) - min(as.numeric(date)),
t_post = max(0, as.numeric(date) - as.numeric(min(.data$date[treated == TRUE])))
) %>%
ungroup()
}
sim_output <- sim_data %>%
estimate_model(formula = update(fml, yobs ~ .)) %>%
mutate(true_effect = mean(sim_data$y1 - sim_data$y0, na.rm = TRUE))
return(sim_output)
}
test <- total_data %>%
compute_simulation(
formula = "deaths_na_causes ~ treated",
n_days_study = 700,
quasi_exp = "national_short"
)
We will then loop this function to get a large number of replications of each simulation for a given set of parameters. We will also vary the values of the different parameter.
We can then run the simulations. But first, we define the set of parameters we want to consider.
We create a table displaying in each row a set of parameters we want to have a simulation for, sim_param. We will then map our function compute_simulation on this table.
To build sim_param, we first define a set of baseline values for our parameters and store them in a data frame, sim_param_base. We will then vary the values of the parameters one after the other. We thus create vectors containing the different values of the parameters we want to test.
sim_param_base <- tibble(
n_days_study = 500,
p_treat = 0.2,
percent_effect_size = 0.5,
formula = "deaths_na_causes ~ treated + temperature + I(temperature^2) | city + month + year + day_of_week"
)
write_csv(sim_param_base, "../Outputs/sim_param_base.csv")
vect_n_days_study <- c(100, 250, 500, 750, 1000)
vect_p_treat <- c(0.01, 0.2)
vect_percent_effect_size <- c(0.1, 0.5, 1)
vect_formula <- c(
"deaths_na_causes ~ treated",
"deaths_na_causes ~ treated | city",
"deaths_na_causes ~ treated | city + month + year + day_of_week",
"deaths_na_causes ~ treated + temperature + I(temperature^2) | city + month + year + day_of_week"
)
We then want to create the actual table, varying parameters one after the other. To do so, we create a simple function add_values_param. This function adds the values of a parameter contained in a vector. We can then map this function on all the vectors of parameters of interest.
#adds all values in vect_param
add_values_param <- function(df, vect_param) {
param_name <- str_remove(vect_param, "vect_")
tib_param <- tibble(get(vect_param))
names(tib_param) <- param_name
df %>%
full_join(tib_param, by = param_name) %>%
fill(everything())
}
vect_of_vect_param <- c("vect_n_days_study", "vect_p_treat", "vect_percent_effect_size", "vect_formula")
sim_param_unique <-
map_dfr(
vect_of_vect_param,
add_values_param,
df = sim_param_base
) %>%
distinct()
We want to compute our simulations for this set of parameters for every quasi experiment. Note that, in order to identify the effect of interest, we consider different types of models, depending on the identification strategy considered for each quasi experiment.
Then, for each set of parameters we want to run many iterations of the simulation, n_iter. We thus modify sim_param so that each set of parameter appears n_iter times. It will enable us to loop compute_simulation directly on sim_param.
vect_quasi_exp <- c("random_days", "national", "local", "national_short", "rolling") #, "alert_pm10"
n_iter <- 2
sim_param <- sim_param_unique %>%
crossing(vect_quasi_exp) %>%
rename(quasi_exp = vect_quasi_exp) %>%
mutate(
formula = case_when(
quasi_exp == "local" ~ str_replace_all(formula, "treated",
"treated + post + city_treated"),
str_starts(quasi_exp, "national") ~ str_replace_all(
formula, "treated",
"treated + t + t_post"),
str_starts(quasi_exp, "alert") ~ str_replace_all(formula, "treated", paste(
"treated + ", str_remove(quasi_exp, "alert_")
)),
TRUE ~ formula
)
) %>%
crossing(rep_id = 1:n_iter) %>%
select(-rep_id) %>%
mutate(sim_id = row_number())
# write_csv(sim_param, "../Outputs/sim_param.csv")
We can then run the simulations for each set of parameter, using a pmap_dfr function.
tic()
simulations <- future_pmap_dfr(
sim_param %>% select(-sim_id),
compute_simulation,
data = total_data,
.id = "sim_id",
.options = furrr_options(seed = TRUE)
)
toc()
all_simulations <- simulations %>%
as_tibble() %>%
mutate(sim_id = as.numeric(sim_id)) %>%
left_join(sim_param, by = "sim_id") %>%
select(-sim_id)
# saveRDS(all_simulations, "../Outputs/data_simulations.RDS")
We then summarize our results, computing power, type M and so on for each set of parameters using he function summarise_simulations. Note that this function can only take as input a data frame produced by compute_simulation (or a mapped version of this function).
summarise_simulations <- function(data) {
data %>%
group_by(formula, quasi_exp, n_days_study, p_treat, percent_effect_size) %>%
summarise(
power = mean(p_value <= 0.05, na.rm = TRUE)*100,
bias = mean(abs((estimate - true_effect)/true_effect), na.rm = TRUE),
average_p_value = mean(p_value, na.rm = TRUE),
average_n_obs = mean(n_obs, na.rm = TRUE),
type_m = mean(ifelse(p_value < 0.05, abs(estimate)/true_effect, NA), na.rm = TRUE),
type_s = sum(ifelse(p_value < 0.05, sign(estimate) != sign(true_effect), NA), na.rm = TRUE)/n()*100,
.groups = "drop"
) %>%
ungroup()
}
We then run this function.
summary_simulations <- summarise_simulations(all_simulations)
# saveRDS(summary_simulations, "../Outputs/summary_simulations.RDS")